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We report a method for estimating what percentage of people who cited a 
paper had actually read it. The method is based on a stochastic modeling 
of the citation process that explains empirical studies of misprint distri- 
butions in citations (which we show follows a Zipf law). Our estimate is 
that only about 20% of citers read the original. 


Many psychological tests have the so-called “lie-scale.” A small but 
sufficient number of questions that admit only one true answer, such as: 
“Do you always reply to letters immediately after reading them?” are 
inserted among others that are central to the particular test. A wrong 
reply for such a question adds a point on the lie-scale, and when the 
lie-score is high, the over-all test results are discarded as unreliable. 
Perhaps, for a scientist the best candidate for such a lie-scale is the 
question: “Do you read all of the papers that you cite?” 

Comparative studies of the popularity of scientific papers has been a 
subject of much recent interest [1-8], but the scope has been limited to 
citation distribution analysis. We have discovered a method of estimat- 
ing what percentage of people who cited the paper had actually read it. 
Remarkably, this can be achieved without any testing of the scientists, 
but solely on the basis of the information available in the ISI citation 
database. 

Freud [9] had discovered that the application of his technique of psy- 
choanalysis to slips in speech and writing could reveal a lot of hidden 
information about human psychology. Similarly, we find that the appli- 
cation of statistical analysis to misprints in scientific citations can give 
an insight into the process of scientific writing. As in the freudian case, 
the truth revealed is embarrassing. For example, an interesting statistic 
revealed in our study is that a lot of misprints are identical. Consider, 
for example, a four-digit page number with one digit misprinted. There 
can be 104 such misprints. The probability of repeating someone else’s 
misprint accidentally is 1074. There should be almost no repeat mis- 
prints by coincidence. One concludes that repeat misprints are due to 
copying someone else’s reference, without reading the paper in question. 

In principle, one can argue that an author might copy a citation 
from an unreliable reference list, but still read the paper. A modest 
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Figure 1. Rank—-frequency distribution of misprints referencing a paper, which 
had acquired 4300 citations. There are 196 misprints total, out of which 45 are 
distinct. The most popular misprint propagated 78 times. A good fit to Zipf’s 
law is evident. 


reflection would convince one that this is relatively rare, and cannot 
apply to the majority. Surely, in the pre-internet era it took almost 
equal effort to copy a reference as to type in one’s own based on the 
original, thus providing little incentive to copy if someone has indeed 
read, or at the very least has procured access to the original. Moreover, 
if someone accesses the original by tracing it from the reference list 
of a paper with a misprint, then with a high likelihood, the misprint 
has been identified and will not be propagated. In the past decade with 
the advent of the Internet, the ease with which would-be nonreaders can 
copy from unreliable sources; as well as would-be readers that can access 
the original, has become equally convenient. But there is no increased 
incentive for those who read the original to also make verbatim copies, 
especially from unreliable resources.! In the rest of this paper, giving 
the benefit of doubt to potential nonreaders, we adopt a much more 
generous view of a “reader” of a cited paper as someone who at the 
very least consulted a trusted source (e.g., the original paper or heavily- 
used and authenticated databases) in putting together the citation list. 
As misprints in citations are not too frequent, only celebrated papers 
provide enough statistics to work with. Figure 1 shows a distribution 
of misprints in citations to one such paper [10] in the rank-frequency 
representation, introduced by Zipf [11]. The most popular misprint in 


1 According to many researchers the Internet may end up even aggravating the copying 
problem: more users are copying second-hand material without verifying or referring to 
the original sources. 
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Figure 2. Same data as in Figure 1, but in the number-frequency representation. 
Misprints follow a power-law distribution with exponent close to 2. 


a page number propagated 78 times. Figure 2 shows the same data, but 
in a number-frequency format. 

As a preliminary attempt, one can estimate an upper bound on the 
ratio of the number of readers to the number of citers R as the ratio of 
the number of distinct misprints D to the total number of misprints T. 
Clearly, among T citers, T —- D copied, because they repeated someone 
else’s misprint. For the D others, with the information at hand, we 
have no evidence that they did not read, so according to the presumed 
innocent principle, we assume that they did. Then in our sample, we 
have D readers and T citers, which lead to: 

D 
Re = (1) 

Substituting D = 45 and T = 196 in equation (1), we obtain that 
R x 0.23. This estimate would be correct if the people who introduced 
original misprints had always read the original paper. However, given 
the low value of the upper bound on R, it is obvious that many original 
misprints were introduced while copying references. Therefore, a more 
careful analysis is neccessary. We need a model to accomplish it. 

Our model for misprints propagation, which was stimulated by Si- 
mon’s explanation of the Zipf law [12] and the idea of link redirection 
by Krapivsky and Redner [4] is as follows. Each new citer finds the 
reference to the original in any of the papers that already cite it. With 
probability R he reads the original. With probability 1— R he copies the 
citation from the paper he found it in. In any case, with probability M 
he introduces a new misprint. 
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The evolution of the misprint distribution (here Ng denotes the num- 
ber of misprints that propagated K times, and N is the total number of 
citations) is described by the following rate equations: 


dN, _ E N; 

me = M-(1-R)x(1-M)x 

Neg de = 7 (K-1)xNy_,-KxNx 

SN (L-R) x (1M) x SESE (K > 1). (2) 


These equations can be easily solved using methods developed in [4] to 
get: 
1 1 


Ne~ Ry Y= 3 * TTR) 


(3) 


As the exponent of the number-frequency distribution y is related to the 
exponent of the rank—-frequency distribution « by a relation y = 1+(1/a), 
equation (3) implies that: 


a =(1—R)x(1-—M). (4) 


The rate equation for the total number of misprints is: 


dT T 
oR = M+ (1-R)x(1-M)x 5. (5) 
The stationary solution of equation (5) is: 
M 
DNA REM MR 2 


The expectation value for the number of distinct misprints is obviously 


D=NxM. (7) 
From equations (6) and (7) we obtain: 
D N-T 
Rea aon: á 


Substituting D = 45, T = 96, and N = 4300 in equation (8), we obtain 
R x~ 0.22, which is very close to the initial estimate obtained using 
equation (1). This low value of R is consistent with the “Principle of 
Least Effort” [11]. 

One can ask: Why did we not choose to extract R using equations (3) 
or (4)? This is because a and y are not very sensitive to R when it is 
small. In contrast, T scales as 1/R. 

We can slightly modify our model and assume that original misprints 
are only introduced when the reference is derived from the original 
paper, while those who copy references do not introduce new misprints 
(e.g., they cut-and-paste). In this case one can show that T = N x M 
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and D=NxMxR. Asa consequence, equation (1) becomes exact (in 
terms of expectation values, of course). 

The preceding analysis assumes that the stationary state had been 
reached. Is this reasonable? Equation (5) can be rewritten as: 


d(z 
a S =dInN. (9) 
M-(x)x(R+M-MxR) 
As long as M is small it is natural to assume that the first citation was 
correct. Then the initial condition is N = 1; T = 0. Equation (9) can be 
solved to get: 


M 1 
TNX ore «(> eee: nd 


This should be solved numerically for R. For our guinea pig, equa- 
tion (10) gives R = 0.17. 
Just as a cautionary note, equation (10) can be rewritten as: 


522%(t- aR) x=R+M-MxR. (11) 
N 


The definition of the natural logarithm is: 


. a*-1 
lna = lim R 
x>0 x 





Comparing this with equation (11) we see that when R is small (M is 
obviously always small): 


Ts InN. (12) 


This means that a naive analysis using equations (1) or (8) can lead to 
an erroneous belief that more cited papers are less read. 

One can augment our results with a closer scrutiny of the data. In 
order to make sure that misprints have not been introduced by the ISI 
as it sometimes happens [13], we explicitly verified a dozen misprinted 
citations in the original articles. All of them were exactly as in the ISI 
database. There are also occasional repeat identical misprints in papers, 
which share individuals in their author lists. Such events constitute a 
minority of repeat misprints. It is not obvious what to do with such 
cases when the author lists are not identical: Should the set of citations 
be counted as a single occurrence (under the premise that the common 
co-author is the only source of the misprint); or as multiple repetitions? 
However, even if we count all such repetitions as only a single misprint 
occurrence, then the number of citation-copiers (i.e., T — D) shall drop 
from 151 to 112, bringing the upper bound for R (equation (1)) from 
23% up to 29%. However a more detailed analysis via our model 
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[14] will bring down the estimate closer to 20%, keeping the original 
conclusions unaltered. 

We conclude that misprints in scientific citations should not be dis- 
carded as a mere happenstance, but, similar to Freudian slips, analyzed. 
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